第五章 相关与回归 - 复习练习题
| 学生编号 | 学习时间 (x) | 考试成绩 (y) |
|---|---|---|
| 1 | 5 | 65 |
| 2 | 8 | 72 |
| 3 | 12 | 85 |
| 4 | 6 | 68 |
| 5 | 10 | 80 |
| 6 | 7 | 74 |
| 7 | 15 | 92 |
| 8 | 9 | 78 |
| 9 | 11 | 88 |
| 10 | 4 | 60 |
通过观察数据点分布,可以看出学习时间和考试成绩之间存在正相关关系。随着学习时间的增加,考试成绩通常也会提高。
首先,计算必要的统计量:
By observing the distribution of data points, we can see a positive correlation between study time and exam scores. As study time increases, exam scores generally increase.
First, calculate the necessary statistics:
计算相关系数r:
\[r = \frac{n\sum xy - \sum x \sum y}{\sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 - (\sum y)^2]}}\]
\[r = \frac{10\times6879 - 87\times762}{\sqrt{[10\times859 - 87^2][10\times58674 - 762^2]}}\]
\[r = \frac{68790 - 66294}{\sqrt{[8590 - 7569][586740 - 580644]}}\]
\[r = \frac{2496}{\sqrt{1021 \times 6096}}\]
\[r = \frac{2496}{\sqrt{6224016}}\]
\[r = \frac{2496}{2494.8}\]
\[r \approx 0.999\]
相关系数r ≈ 0.999,这表明学习时间和考试成绩之间存在极强的正线性关系。相关系数非常接近1,说明两个变量几乎完全正相关。这意味着学习时间增加,考试成绩很可能会提高,反之亦然。
The correlation coefficient r ≈ 0.999 indicates an extremely strong positive linear relationship between study time and exam scores. The correlation coefficient is very close to 1, suggesting that the two variables are almost perfectly positively correlated. This means that as study time increases, exam scores are very likely to increase, and vice versa.
| 月份 | 广告费 (x) | 销售额 (y) |
|---|---|---|
| 1 | 2 | 5 |
| 2 | 3 | 7 |
| 3 | 4 | 8 |
| 4 | 5 | 10 |
| 5 | 6 | 12 |
| 6 | 7 | 14 |
首先,计算必要的统计量:
First, calculate the necessary statistics:
计算斜率b:
\[b = \frac{n\sum xy - \sum x \sum y}{n\sum x^2 - (\sum x)^2}\]
\[b = \frac{6\times282 - 27\times56}{6\times139 - 27^2}\]
\[b = \frac{1692 - 1512}{834 - 729}\]
\[b = \frac{180}{105}\]
\[b = 1.7143\]
计算截距a:
\[a = \bar{y} - b\bar{x}\]
\[a = \frac{56}{6} - 1.7143\times\frac{27}{6}\]
\[a = 9.3333 - 1.7143\times4.5\]
\[a = 9.3333 - 7.7144\]
\[a = 1.6189\]
因此,回归直线方程为:\[y = 1.6189 + 1.7143x\]
回归系数b = 1.7143 表示广告费每增加1千元,销售额平均增加1.7143万元。这表明广告费对销售额有显著的正向影响。
当x = 9千元时,代入回归方程:
y = 1.6189 + 1.7143 × 9 = 1.6189 + 15.4287 = 17.0476万元
因此,预测销售额约为17.05万元。
The regression coefficient b = 1.7143 indicates that for each 1 thousand yuan increase in advertising expenses, sales revenue increases by an average of 1.7143 ten thousand yuan. This shows that advertising expenses have a significant positive impact on sales revenue.
When x = 9 thousand yuan, substituting into the regression equation:
y = 1.6189 + 1.7143 × 9 = 1.6189 + 15.4287 = 17.0476 ten thousand yuan
Therefore, the predicted sales revenue is approximately 17.05 ten thousand yuan.
这种相关关系并不意味着吃冰淇淋会导致溺水。相关关系只表示两个变量之间存在统计上的关联,但不能直接推断因果关系。在这个例子中,正相关可能是由于其他因素导致的,而不是冰淇淋消费直接导致溺水。
最可能的潜在变量是温度或季节。在夏季,气温较高时,人们更倾向于购买冰淇淋,同时也更可能去游泳,从而增加了溺水的风险。因此,温度是影响这两个变量的共同因素,造成了它们之间的正相关关系。
确定因果关系需要满足以下条件:
在冰淇淋和溺水的例子中,我们可以通过分析不同温度下的溺水数据,或者在控制温度变量后观察冰淇淋消费与溺水之间的关系,来判断是否存在真正的因果关系。
This correlation does not mean that eating ice cream causes drowning. Correlation only indicates a statistical association between two variables but cannot directly infer causation. In this case, the positive correlation might be due to other factors rather than ice cream consumption directly causing drowning.
The most likely potential variable is temperature or season. During summer, when temperatures are higher, people are more likely to buy ice cream and also more likely to swim, thereby increasing the risk of drowning. Therefore, temperature is a common factor affecting both variables, creating a positive correlation between them.
To establish a causal relationship, the following conditions need to be met:
In the ice cream and drowning example, we can analyze drowning data at different temperatures or observe the relationship between ice cream consumption and drowning while controlling for temperature variables to determine if there is a true causal relationship.
| 儿童 | 年龄 (x) | 身高 (y) |
|---|---|---|
| 1 | 2 | 85 |
| 2 | 3 | 90 |
| 3 | 4 | 98 |
| 4 | 5 | 105 |
| 5 | 6 | 112 |
| 6 | 7 | 118 |
| 7 | 8 | 125 |
| 8 | 9 | 132 |
首先,计算必要的统计量:
First, calculate the necessary statistics:
计算相关系数r:
\[r = \frac{n\sum xy - \sum x \sum y}{\sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 - (\sum y)^2]}}\]
\[r = \frac{8\times4981 - 44\times865}{\sqrt{[8\times284 - 44^2][8\times95697 - 865^2]}}\]
\[r = \frac{39848 - 38060}{\sqrt{[2272 - 1936][765576 - 748225]}}\]
\[r = \frac{1788}{\sqrt{336 \times 17351}}\]
\[r = \frac{1788}{\sqrt{5830936}}\]
\[r = \frac{1788}{2414.73}\]
\[r \approx 0.740\]
计算回归系数b:
\[b = \frac{n\sum xy - \sum x \sum y}{n\sum x^2 - (\sum x)^2}\]
\[b = \frac{8\times4981 - 44\times865}{8\times284 - 44^2}\]
\[b = \frac{39848 - 38060}{2272 - 1936}\]
\[b = \frac{1788}{336}\]
\[b = 5.3214\]
计算截距a:
\[a = \bar{y} - b\bar{x}\]
\[a = \frac{865}{8} - 5.3214\times\frac{44}{8}\]
\[a = 108.125 - 5.3214\times5.5\]
\[a = 108.125 - 29.2677\]
\[a = 78.8573\]
因此,回归直线方程为:\[y = 78.8573 + 5.3214x\]
当x = 10岁时,代入回归方程:
y = 78.8573 + 5.3214 × 10 = 78.8573 + 53.214 = 132.0713厘米
因此,预测一名10岁儿童的身高约为132.07厘米。
这个预测的可靠性受到以下因素影响:
总体而言,这个预测在群体水平上是合理的,但对于个体儿童可能不够精确。
When x = 10 years, substituting into the regression equation:
y = 78.8573 + 5.3214 × 10 = 78.8573 + 53.214 = 132.0713 cm
Therefore, the predicted height of a 10-year-old child is approximately 132.07 cm.
The reliability of this prediction is affected by the following factors:
Overall, this prediction is reasonable at the population level but may not be precise for individual children.